e1071::svm(): Use formula interface only if factors are present #1740

mb706 · 2017-03-28T10:02:33Z

See if this fixes #1738

mb706 · 2017-03-28T10:02:43Z

Note 1: You guys could systematically go through the learners and look for learners that use the formula interface but could conceivably use a superior data.frame interface. Anecdotically, I have seen the shim that is e1071:::svm.formula many times before (where they just get the data.frame out of the formula and then use that). EDIT: This turns out to be a bit more complicated, since the formula interface is used for dummy-encoding of factor features. It could still be done with some work. Things to look out for are parameters that somehow refer to columns (e.g. weighting or scaling of different features) and need to be padded to the new number of columns, and whether or not an intercept column is added.

Note 2: I don't really like the test I implemented, it takes a relatively large amount of time and memory while really only testing for a design decision. Feel free to leave it out.

Note 3: If you think the test for many-feature-dataframes are a good idea, it would be possible to systematically apply that test to all the learners that are supposed to handle many-features-situations.

mb706 · 2017-03-28T12:55:49Z

This turns out to be harder than I thought, see my comment in the issue

berndbischl · 2017-03-29T10:44:56Z

i would not advise to merge this. see what i also posted in the issue. we should create a clean issue out of this.

mllg · 2017-04-03T12:54:33Z

The latest changes (only using formula if factors are present) looks good to me.

larskotthoff · 2017-04-03T16:59:10Z

Yep, looks good to me.

berndbischl · 2017-04-05T19:33:02Z

Yep, looks good to me.

how about commenting on the general issue i raised? this is more then "hey code looks good here".
i mean you can argue that we can implement this only for this learner as this is a clear improvement, but then please do so. and that we should possibly open up a more general issue for the general problem and not block this improvement.

also does anybody see any PROBLEM that thi might introduce? or is it just a clear enhancement?

berndbischl · 2017-04-05T19:36:45Z

i mean, just not the possible side effects this has with this PR #1763

what happens if we merge that too?
no the formula is taken ONLY if we task ALSO contains factors? pretty weird semantics right?
or stated otherwise: does the svm then support the (now not existing) "formula" property...?

larskotthoff · 2017-04-05T19:38:50Z

As I've said in the other thread, an alternative would be to have two separate learners for this. That would avoid issues with being able to specify formulas.

berndbischl · 2017-04-05T19:40:26Z

As I've said in the other thread,

where exactly please?

larskotthoff · 2017-04-05T19:42:16Z

#1738 (comment)

berndbischl · 2017-04-05T19:48:11Z

#1738 (comment)

i am not a big fan of this. it is
a) confusing for users and overly complex to undertand
b) it copies code A LOT.
c) the svm mlr learner should really do this internally

giuseppec · 2017-04-05T20:18:50Z

I have to agree with bernd. We should not try to autodetect this and internally do a if else here that changes the behaivour. I thought that our convention was to always use the formula interface whenever possible (e.g. randomForest). I think the user should be able to choose this himself.
Have to think about this a bit longer but one possible solution could be to introduce a formula (and a data.frame) property and let the user decide what to use (e.g. using a configureMlr or as an option of the learner or by checking if the task contains a formula and always prefer the formula interface if the learner supports this, otherwise use the data interface).

giuseppec · 2017-04-06T05:29:10Z

What cases do we have?

the learner supports both, using a formula and data.frame (or matrix)
the learner supports only using a formula
the learner supports only using a data.frame (or matrix)
???

Concrete suggestion:

add a new option prefer.formula.interface to configureMlr
default should be prefer.formula.interface = TRUE which always prefers the formula interface whenever possible (i.e. when the learner has a formula property which we have to add)
learners that allow using a formula should have a formula properties

Consequences:

this way, we don't change the current behaivour and it should be always clear what mlr is doing (currently this is not even clear for users because it is not documented which learner uses which interface)
a user will be able to change the interface (and speedup) learners that also support using data.frames (or matrices)
contra: our tests will take longer as we have to test both options.

mb706 · 2017-04-06T12:46:54Z

My opinion: I suggest not to complicate the UI but instead make getTaskData to also conform to the formula if one is given (see my comment on the pr).

Different solution: Use the formula if a user specified formula was attached to the task (would add another condition to the if clause), and use the data.frame directly otherwise.

larskotthoff · 2017-07-14T01:14:44Z

What's the status here? Have we decided on a course of action?

giuseppec · 2017-07-14T08:41:52Z

I think we haven't made a concrete decision yet. Do we even need this PR here which only solves this issue for SVM? Or do we want to solve this on an abstract level?
I still have this incomplete PR #1763 with some concrete TODOs where we could also try to address the issue mentioned here.

pat-s · 2017-08-02T14:47:01Z

pushing this discussion as it sets the base for some PRs now (#1899, #1763, #1740)

@giuseppec is again working on #1763, specifying the formula within the task.
Is this agreed on? I think is not that practical because than the task can only be used with models supporting this particular formula notation (correct me if I'm wrong?)

There must always be one place where the formula is specified manually. Why is the ParamSet of a learner (as in #1899) a bad place?

giuseppec · 2017-08-02T19:29:26Z

Yes, I started working at this again. But I am currently just playing around, still not finished ; -).
No, we don't have an agreement yet. What I want to do: Everything that currently works should of course still work +I would like to add a possibility to specify the formula in the task (just optional). But I agree with you that this can then only be used for learners that support formula objects. However, as @mb706 suggested we could extend getTaskData so that it uses only the features mentioned in the formula.
I don't think that adding a formula to the learner-object is bad in general (I rather think that maybe both should be possible, specifying the formula in task AND in learner objects).
Suppose you want to apply a learner to 2 or more different tasks (e.g. using the benchmark function). Setting the formula in the learner will only work for one task (as you need to add the feature names).
I think it is very important that we do this stepwise (maybe multiple smaller PR). A aig PR has a high chance that it will never be merged.

pat-s · 2017-08-03T15:43:57Z

But I agree with you that this can then only be used for learners that support formula objects.

This would be ok if the same formula notation is accepted by multiple learners. However, in the case of mgcv::gam the s() notation is unique I think (at least I do not know of any other smooth notation). So if we would implement it this way, we would need to create an extra task only for mgcv::gam.

@berndbischl
what is the main issue that a formula should not be in the paramset for specific learners?

pat-s · 2019-06-06T09:31:12Z

Since the discussion about formula handling was ported to mlr3 and will probably not solved anymore here (which blocked the merging of this PR), is there something that would still speak against merging this SVM fix here?

Build URL: https://travis-ci.org/mlr-org/mlr/builds/542175846 Commit: 5565287

pat-s · 2019-06-17T14:04:03Z

@larskotthoff ready to merge?

Build URL: https://travis-ci.org/mlr-org/mlr/builds/546742364 Commit: de67d1a

…org#1740) * Testing svm with many features task * svm use data.frame instead of formula * spaces around match operator * Only use svm data.frame interface if task is all numeric * Deploy from Travis build 13884 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/542175846 Commit: 5565287 * add NEWS entry * Deploy from Travis build 13922 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/546742364 Commit: de67d1a

mb706 added 2 commits March 28, 2017 11:41

Testing svm with many features task

38daf4d

svm use data.frame instead of formula

2d3a980

mb706 added pr-please review project - Learners labels Mar 28, 2017

spaces around match operator

a26488b

Only use svm data.frame interface if task is all numeric

b95ee16

giuseppec mentioned this pull request Jul 24, 2017

add mgcv::gam as RLearner_classif_gam #1899

Closed

3 tasks

This was referenced Feb 22, 2018

Allow specification of formulas in tasks #1763

Closed

Integration of formulas into tasks #2177

Closed

berndbischl force-pushed the master branch 2 times, most recently from cf3db65 to 57f099e Compare March 3, 2018 12:47

ja-thomas force-pushed the master branch from 57f099e to 20d5a39 Compare March 3, 2018 22:09

pat-s changed the title ~~Fix 1738 svm no formula~~ e1071::svm(): Use formula interface only if factors are present Jun 6, 2019

Merge branch 'master' into fix_1738_svm_no_formula

5565287

web-flow and others added 3 commits June 6, 2019 10:43

Deploy from Travis build 13884 [ci skip]

a458e03

Build URL: https://travis-ci.org/mlr-org/mlr/builds/542175846 Commit: 5565287

Merge branch 'master' into fix_1738_svm_no_formula

0c8b202

add NEWS entry

de67d1a

pat-s approved these changes Jun 17, 2019

View reviewed changes

pat-s added pr-ready for merge (?) and removed pr-please review labels Jun 17, 2019

pat-s and others added 2 commits June 17, 2019 14:26

Deploy from Travis build 13922 [ci skip]

826b071

Build URL: https://travis-ci.org/mlr-org/mlr/builds/546742364 Commit: de67d1a

Merge branch 'master' into fix_1738_svm_no_formula

c2c37bd

larskotthoff approved these changes Jun 17, 2019

View reviewed changes

larskotthoff merged commit e6f045e into master Jun 17, 2019

larskotthoff deleted the fix_1738_svm_no_formula branch June 17, 2019 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e1071::svm(): Use formula interface only if factors are present #1740

e1071::svm(): Use formula interface only if factors are present #1740

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017 •

edited

mb706 commented Mar 28, 2017

berndbischl commented Mar 29, 2017

mllg commented Apr 3, 2017

larskotthoff commented Apr 3, 2017

berndbischl commented Apr 5, 2017

berndbischl commented Apr 5, 2017

larskotthoff commented Apr 5, 2017

berndbischl commented Apr 5, 2017

larskotthoff commented Apr 5, 2017

berndbischl commented Apr 5, 2017

giuseppec commented Apr 5, 2017 •

edited

giuseppec commented Apr 6, 2017 •

edited

mb706 commented Apr 6, 2017

larskotthoff commented Jul 14, 2017

giuseppec commented Jul 14, 2017

pat-s commented Aug 2, 2017

giuseppec commented Aug 2, 2017 •

edited

pat-s commented Aug 3, 2017

pat-s commented Jun 6, 2019

pat-s commented Jun 17, 2019

e1071::svm(): Use formula interface only if factors are present #1740

e1071::svm(): Use formula interface only if factors are present #1740

Conversation

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017 • edited

mb706 commented Mar 28, 2017

berndbischl commented Mar 29, 2017

mllg commented Apr 3, 2017

larskotthoff commented Apr 3, 2017

berndbischl commented Apr 5, 2017

berndbischl commented Apr 5, 2017

larskotthoff commented Apr 5, 2017

berndbischl commented Apr 5, 2017

larskotthoff commented Apr 5, 2017

berndbischl commented Apr 5, 2017

giuseppec commented Apr 5, 2017 • edited

giuseppec commented Apr 6, 2017 • edited

mb706 commented Apr 6, 2017

larskotthoff commented Jul 14, 2017

giuseppec commented Jul 14, 2017

pat-s commented Aug 2, 2017

giuseppec commented Aug 2, 2017 • edited

pat-s commented Aug 3, 2017

pat-s commented Jun 6, 2019

pat-s commented Jun 17, 2019

mb706 commented Mar 28, 2017 •

edited

giuseppec commented Apr 5, 2017 •

edited

giuseppec commented Apr 6, 2017 •

edited

giuseppec commented Aug 2, 2017 •

edited